🧾 Description In this step, we define and use the `load_and_label_news` function to prepare our dataset for analysis and modeling. This function performs the following tasks: 1. Loads the two CSV files: 'Fake.csv' and 'True.csv'. 2. Labels the entries: assigning 0 to fake news and 1 to real news. 3. Combines both datasets into a single DataFrame for unified processing. 4. Shuffles the combined dataset to ensure random distribution of samples, avoiding any bias from file ordering. The resulting DataFrame is ready for exploratory data analysis, text preprocessing, and model training.
The dataset originally contains detailed subject labels such as 'politicsNews', 'left-news', 'worldnews', etc. These are grouped into three broader domains(as mentioned in the mail): - Politics: Political and government-related news (politicsNews, politics, left-news, Government News) - News: General news (worldnews, News, US_News) - Other: Topics that don't fit the above categories (Middle-east) This mapping simplifies analysis by allowing model performance evaluation across these three interpretable domains.
📌 Basic Dataset Overview Summary - The dataset contains a total of 44,898 news articles with 5 columns: `title`, `text`, `subject`, `date`, and `label`. - All columns are complete with no missing values. - Data types are appropriate: text-related columns are of type `object`, and the binary target `label` is of type `int64`. - Sample records from the beginning and end confirm a mix of fake and real news across various topics. - A total of 209 duplicate rows were found, which may need to be removed in preprocessing.
🧹 Data Cleaning - Removed 209 duplicate rows, reducing the dataset size from 44,898 to 44,689 records. - Converted the `date` column from string to datetime format using `pd.to_datetime()`. - Identified 10 rows with invalid or unparseable date entries. - Dropped these 10 rows, resulting in a final cleaned dataset with 44,679 rows. These cleaning steps ensure the dataset is free of duplicates and that date values are in a consistent and usable format for future temporal analysis.
The dataset consists of a total of 44,679 news articles after cleaning, with a slight class imbalance. There are 23,468 (approximately 52.5%) articles labeled as Fake News, and 21,211 (approximately 47.5%) labeled as Real News. This distribution indicates a moderately balanced dataset, providing a solid foundation for building a binary classification model.
The dataset categorizes news articles into three primary domains: Politics, News, and Other. The distribution of articles across these domains is as follows: - Politics: 24,077 articles, representing approximately 53.89% of the dataset. - News: 19,824 articles, accounting for about 44.37% of the total. - Other: 778 articles, which make up roughly 1.74% of the dataset. This distribution shows that the majority of the articles are related to political topics, followed closely by general news. The 'Other' category comprises a small fraction of the dataset.
News Subject Distribution - Summary - The dataset contains news articles spanning 8 distinct subjects/topics. - PoliticsNews is the most common subject, comprising about 25.1% of the dataset (11,220 articles). - Other prominent subjects include: - Worldnews (22.4%) - News (20.3%) - Politics (15.3%) - Smaller portions of the dataset are from: - left-news (10%) - Government News (3.5%) - US_News (1.75%) - Middle-east (1.74%) - The total cleaned dataset consists of 44,679 news articles after removing duplicates and invalid entries.
Insights - Most texts are short (under 1000 words), with a few very long outliers (up to 8000 words). - Preprocessing focus: Prioritize handling short texts (e.g., padding/truncation), but check if long texts impact model performance. - Outliers: Verify if extreme-length texts are noise (e.g., errors) or meaningful (e.g., reports).
🗓️ The news articles in the dataset span from March 31, 2015, to February 19, 2018, covering nearly three years of content. This date range provides a diverse temporal context for analyzing trends in fake and real news over time.
The dataset is divided into three main domains: News, Politics, and Other. - In the News domain, the number of fake and real articles is roughly balanced, with 9,833 fake and 9,991 real articles. - The Politics domain has a higher proportion of fake news articles (12,857 fake) compared to real news (11,220 real). - The Other domain contains only fake news articles, totaling 778. Overall, the Politics domain shows a stronger presence of fake news relative to real news, while the News domain is more balanced. This distribution highlights differences in news authenticity across domains, which can impact modeling and analysis strategies.
This function `preprocess_text` takes raw input text and performs comprehensive cleaning and preprocessing to prepare it for natural language processing tasks. The key steps are: 1. Convert text to lowercase to standardize casing. 2. Remove content within square brackets which often contain citations or notes. 3. Strip out URLs to eliminate irrelevant links. 4. Remove any HTML tags that may appear in the text. 5. Delete all punctuation marks to focus on the core words. 6. Replace newline characters with spaces to preserve sentence structure. 7. Remove any words containing digits to exclude numbers or alphanumeric codes. 8. Eliminate extra whitespace and trim leading/trailing spaces. 9. Tokenize the cleaned text into individual words for processing. 10. Lemmatize each token to reduce words to their base or dictionary form. 11. Remove common English stopwords that do not add meaningful content. 12. Reassemble the cleaned tokens back into a single, space-separated string. This preprocessing pipeline is designed to reduce noise and enhance the quality of text data for tasks such as classification, topic modeling, or other NLP applications.
This code snippet prepares the dataset for machine learning modeling by performing the following steps: - Selects the 'cleaned_text' column as the feature set (X) and the 'label' column as the target variable (y). - Splits the dataset into training and testing subsets using an 80-20 split while preserving the class distribution(stratification) to ensure balanced representation of both classes in train and test sets. - Initializes a TF-IDF Vectorizer to convert the raw text into numerical feature vectors that reflect term importance. - Fits the vectorizer on the training data to learn the vocabulary and IDF values, then transforms the training text into TF-IDF weighted features. - Uses the trained vectorizer to transform the test set into TF-IDF feature vectors, ensuring consistent representation across training and evaluation datasets. These steps prepare the textual data in a format suitable for input into machine learning algorithms.
XGBoost Model Training We train an XGBoost classifier for the binary classification task of distinguishing fake versus real news articles. Key points of the training setup: - Objective: We use 'binary:logistic' as it’s a binary classification problem predicting the probability of a positive class (real news). - Evaluation Metric: The model is evaluated using logarithmic loss ('logloss') during training. - Model Complexity: We control complexity with a maximum tree depth of 6, balancing underfitting and overfitting. - Learning Rate: Set at 0.1 to moderate the step size for boosting, allowing gradual improvement. - Number of Estimators: 100 boosting rounds are specified to build the ensemble. - Early Stopping: Training stops if there is no improvement on the validation set for 10 rounds, preventing overfitting and saving time. - Reproducibility: A fixed random seed ('random_state=45') ensures consistent results. The model is trained on the TF-IDF vectorized training data, while performance is monitored on the test set for early stopping.
Why Accuracy is Chosen as the Primary Metric Accuracy represents the proportion of correctly classified samples out of the total samples. In this binary classification task, the confusion matrix shows that the model correctly predicts the vast majority of both classes, with very few misclassifications: - True Negatives: 4680 - True Positives: 4235 - False Positives: 14 - False Negatives: 7 With an accuracy of 99.76%, the model demonstrates excellent overall performance. Accuracy is particularly suitable here because: - The dataset is relatively balanced or stratified during splitting, reducing bias towards one class. - High accuracy reflects the model’s ability to correctly classify both fake and real news articles. - It provides an intuitive, easy-to-understand summary of model effectiveness for stakeholders. However, accuracy alone can be misleading in imbalanced datasets, so we also consider precision, recall, F1-score, and ROC AUC to ensure robust evaluation of model performance.
The model's accuracy was evaluated separately for each news domain: Politics, News, and Other. Accuracy scores: - Politics: 99.63% - News: 99.92% - Other: 100% The model performs exceptionally well across all domains, with perfect accuracy on the Other domain. Slightly lower accuracy on Politics suggests that domain may have more challenging or ambiguous samples. Domain-wise evaluation helps understand model strengths and weaknesses across different types of news content.
These values show the importance of different words (features) that the XGBoost model used to decide whether a news article is fake or real. The number next to each word tells us how much that word helped the model make accurate predictions. - "reuters" has the highest importance score (~1732), meaning it is the most influential word in distinguishing fake vs real news in this dataset. - Other words like "filessupport", "fact", "century", and "aruba" also contribute, but much less compared to "reuters". - The rest of the words have gradually decreasing importance values, showing they still help the model but are less critical. - This importance is measured using gain, which means how much each feature improves the model's accuracy when used in splits of the decision trees. In summary: Words like "reuters" play a major role in the model's decisions, while many other words have smaller but still valuable impacts. These important words help the model recognize patterns that separate fake news from real news.
✅ Common Features (Appear in All 3 Domains) These features are important across News, Other, and Politics: - reuters - said - via - news - image - obama - know - donald - report - american - washington - leader - agency - lawmaker (Total: 14 common features) ❌ Unique Features per Domain 📰 News-only - minister 📦 Other-only - century - filessupport - last 🏛️ Politics-only - None (all features in Politics appear in at least one other domain) 🔍 Notes - “reuters” is by far the most influential feature across all domains (SHAP > 5). - Features like “minister”, “century”, and “filessupport” provide domain-specific signals. - Common features such as “said”, “via”, “obama”, and “report” likely indicate general patterns in news language.
## Note on Hyperparameter Tuning Due to time constraints, full hyperparameter tuning was not completed. However, incorporating a thorough hyperparameter optimization process (e.g., using RandomizedSearchCV or Optuna) has the potential to significantly improve the model's performance by finding the optimal combination of parameters.